- Rollout and enhance our observability tooling to measure the performance and reliability of our systems
- Help develop our SRE practices and contribute to our product roadmap with the wider Platform team
- Identify and remediate the weaker points of our architecture, using modern fault injection methodologies
- Drive operational excellence and SRE best practices across the engineering group
- Coach and mentor teams in uplifting their use of SRE Practices
- Plan “game days” to run chaos experiments for teams across our platform
- Contribute to our team planning, present work completed at showcase and provide continuous feedback to your team and peers
- Work with key product and technology stakeholders to demonstrate the benefits of SRE
- Prior experience working as a Site Reliability Engineer (or similar role)
- Experience or familiarity with designing and building distributed systems using event sourcing / event-driven architecture (Kafka ideally) and/or API’s:
- Experience with one or more container scheduling/orchestration products - Kubernetes, ECS etc
- Experience building systems on any public cloud provider
- Excellent communication skills. You should be comfortable engaging with developers, architects, product owners and be able to articulate the benefits of SRE Practices.
- Experience with at least one observability product (eg. Datadog, Newrelic) and a solid understanding of distributed tracing
- Experience implementing performance commitments for products and experiences (eg SLA/SLO/SLIs)
- Familiar with different disaster recovery strategies, load balancing & circuit breaking
- Proficiency in Golang (nice to have)
- Working in a community of industry-leading innovators with a diverse and deep set of skills and experience, you will learn, collaborate, and co-create to achieve great things.
- You’ll have ownership of your role, which will allow you to find the right balance between stretch and sustainability, work, and life.
- Culture of comradery and collaboration - OneTeam
- Plus, all the tools and learning you need, the tone is set for you to shine and succeed.
- We know that diversity fosters greater innovation and better customer connection, so we strive to create a team where everyone feels like they belong. We support diversity, inclusion and we are a gender-neutral organisation. We celebrate individuals at their core, so they shine to their best.
- Access to the latest technologies
- An annual budget for your learning and development (this includes learning of your choice, certifications and more)
- Extended parental leave (16 weeks for primary carers), and paid volunteer days
- Flexible working – we balance working together with working flexibly
- People-focussed culture that celebrates achievements big and small
- Oh, and did I mention paid subscriptions and discount cards.
Company
Location
Melbourne - Australia
Job type
Full-Time
Golang Job Details
Lead Site Reliability Engineer
The Lead Site Reliability Engineer will be focused on developing our SRE Practices and uplifting our observability capability. You will be working with our product development teams in uplifting our existing systems and practices, measuring the quality of experience our customers have across the OnePass products. This is an individual contributor (IC) role.
Key Duties and Deliverables:
What You Will Need
Why OneDigital?
Team Benefits – just to name a few!
To support you at work, what you value in life and in the community, our team member benefits include:
Next steps
If this sounds like your next career move, then click on the ‘Apply’ button now. Please note that we may commence interviewing candidates prior to the application closing date.
More Developer Job Boards
Fullstack Developer Jobs Golang Jobs JavaScript Jobs Python Jobs React Jobs Rust Jobs Java Jobs